Synthea Health: Data Analysis and Insights Generation¶
!pip install -r requirements.txt
Exploratory Data Analysis¶
#import the required funcions and packages for EDA
import pandas as pd
from utils.utils import read_data_parse_dates,get_demographic_plot
import json
import plotly.express as px
config_file_path = './config/config.json'
# Load the JSON file
with open(config_file_path, 'r') as file:
config = json.load(file)
# Access the value of 'input_folder'
output_folder = config.get('cleaned_data_path')
print("output Folder:", output_folder)
output Folder: C://Users//veena.vemula//Documents//GitHub//SDE-Skills-Test//Output
!pip install -U kaleido
Requirement already satisfied: kaleido in c:\users\veena.vemula\anaconda3\lib\site-packages (0.2.1)
Read the cleaned and transformed files from Output folder¶
df_patients = read_data_parse_dates(output_folder,"patients_cleaned",["BIRTHDATE"])
df_encounters = read_data_parse_dates(output_folder,"encounters_cleaned",["START_ENC","STOP_ENC"])
df_conditions = read_data_parse_dates(output_folder,"conditions_cleaned",["START_DIAG","STOP_DIAG"])
df_medications = read_data_parse_dates(output_folder,"medications_cleaned",["START_MED","STOP_MED"])
df_procedures = read_data_parse_dates(output_folder,"procedures_cleaned",["START_PROC","STOP_PROC"])
## number of visits per patient
df_pat_cond = pd.merge(df_conditions, df_patients, on='PATIENT_ID', how='inner')
visit_count = df_pat_cond.groupby('PATIENT_ID')['ENCOUNTER'].nunique().reset_index()
result_df_sorted = visit_count.sort_values(by='ENCOUNTER',ascending=False).head(20)
fig = px.bar(result_df_sorted, x='PATIENT_ID', y='ENCOUNTER', title="Top 20 Patients by Visit Count",
labels={'name': 'Patient Name', 'visit_count': 'Number of Visits'},
color='ENCOUNTER', color_continuous_scale='Viridis')
fig.write_html("./output/data_visualization/top_patient_by_visit.html")
fig.to_image(format = "png")
fig.show()
Above visual - Visit count by patient dint gave any insights except highest vist of patient is 140 and lowest is as low as one to two over the span of 70 years . we will look in to visits by Age group
# Merge DataFrames and include age group
df_pat_cond = pd.merge(df_conditions, df_patients, on='PATIENT_ID', how='inner')
# Group by PATIENT_ID and AGE_GROUP, then count unique encounters
visit_count = df_pat_cond.groupby(['age_group'])['ENCOUNTER'].nunique().reset_index()
# Sort by visit count and extract top 20
visit_count_sorted = visit_count.sort_values(by='age_group', ascending=False)
top_20 = visit_count_sorted
# Calculate "Others" group for each age group
#remaining = visit_count_sorted.iloc[20:]
#others = remaining.groupby('age_group')['ENCOUNTER'].sum().reset_index()
#others['PATIENT_ID'] = 'Others'
# Combine top 20 and "Others"
result_df = top_20
# Create bar chart with color representing age groups
fig = px.bar(result_df,
x='age_group',
y='ENCOUNTER',
color='age_group',
title="Patients by Visit Count with Age Groups (Grouped with Others)",
labels={'PATIENT_ID': 'Patient ID', 'ENCOUNTER': 'Number of Visits', 'AGE_GROUP': 'Age Group'},
color_discrete_sequence=px.colors.qualitative.Vivid)
# Save and display
fig.write_html("./output/data_visualization/top_patient_by_visit_age_groups.html")
fig.to_image(format = "png")
fig.show()
Above chart infers 90+ age group has highest number of visits Age group 80-89 and 20-29 have almost same number of visit - decrease in number of vsits to age group 80-89 Rest all follows a natural trend without any spikes in data
df_pat_cond.head(5)
| START_DIAG | STOP_DIAG | PATIENT_ID | ENCOUNTER | CODE | DESCRIPTION | BIRTHDATE | DEATHDATE | SSN | DRIVERS | ... | CITY | STATE | COUNTY | ZIP | LAT | LON | HEALTHCARE_EXPENSES | HEALTHCARE_COVERAGE | age | age_group | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013-06-24 | 2013-07-02 | c1f1fcaa-82fd-d5b7-3544-c8f9708b06a8 | 0b2794bd-ec2b-d34f-0610-2523b3b7fcf0 | 10509002 | Acute bronchitis (disorder) | 2005-07-04 | NaN | 999-49-3323 | S99941126 | ... | Bellingham | Massachusetts | Norfolk County | NaN | 42.035213 | -71.482519 | 402723.415 | 14064.135 | 19.430137 | 10-19 |
| 1 | 2016-02-27 | 2016-03-14 | c1f1fcaa-82fd-d5b7-3544-c8f9708b06a8 | a6d818dd-0983-fd1c-eefa-3d2295532c45 | 283371005 | Laceration of forearm | 2005-07-04 | NaN | 999-49-3323 | S99941126 | ... | Bellingham | Massachusetts | Norfolk County | NaN | 42.035213 | -71.482519 | 402723.415 | 14064.135 | 19.430137 | 10-19 |
| 2 | 2016-08-11 | 2016-08-22 | c1f1fcaa-82fd-d5b7-3544-c8f9708b06a8 | 36d2e781-4655-0a11-1f70-c69856e02019 | 444814009 | Viral sinusitis (disorder) | 2005-07-04 | NaN | 999-49-3323 | S99941126 | ... | Bellingham | Massachusetts | Norfolk County | NaN | 42.035213 | -71.482519 | 402723.415 | 14064.135 | 19.430137 | 10-19 |
| 3 | 2016-11-27 | 2016-12-17 | c1f1fcaa-82fd-d5b7-3544-c8f9708b06a8 | c8eaaf41-958b-31ab-7de5-568cee8751f3 | 444814009 | Viral sinusitis (disorder) | 2005-07-04 | NaN | 999-49-3323 | S99941126 | ... | Bellingham | Massachusetts | Norfolk County | NaN | 42.035213 | -71.482519 | 402723.415 | 14064.135 | 19.430137 | 10-19 |
| 4 | 2017-02-22 | 2017-06-02 | c1f1fcaa-82fd-d5b7-3544-c8f9708b06a8 | 6474f606-5a1b-48c0-bbbf-ad6dcbc24d4e | 16114001 | Fracture of ankle | 2005-07-04 | NaN | 999-49-3323 | S99941126 | ... | Bellingham | Massachusetts | Norfolk County | NaN | 42.035213 | -71.482519 | 402723.415 | 14064.135 | 19.430137 | 10-19 |
5 rows × 32 columns
## frequency of certain diagnoses etc.
df_diag_freq = df_pat_cond[["PATIENT_ID","DESCRIPTION"]].drop_duplicates()["DESCRIPTION"].value_counts().reset_index().rename({"index":"Diagnosis","DESCRIPTION":"frequency"},axis=1).head(20)
fig = px.bar(df_diag_freq, x='frequency', y='count', title="Frequency of top 10 diagnosis",
labels={'name': 'Diagnosis', 'frequency': 'frequency'},
color='frequency', color_continuous_scale='Viridis')
fig.write_html("./output/data_visualization/top100diagfreq.html")
fig.to_image(format = "png")
fig.show()
From above graph we can infer Stress, Sinusitis , social isolation are high in number
# Age dist
fig = px.histogram(df_patients, x='age',
title="Age Distribution",
labels={'age': 'Age'},
# Number of bins for grouping ages
color_discrete_sequence=['blue'])
# Show the plot
fig.to_image(format = "png")
fig.show()
Age distribution follows normal distribution. The spread seems to be normal except there is unusual hike after 100 yeras
# Age Distribution by Gender
fig = px.histogram(
df_patients,
x='age', # Ensure the column name matches your DataFrame
title="Age Distribution by Gender",
labels={'AGE': 'Age', 'GENDER': 'Gender'},
nbins=20, # Adjust the number of bins for grouping ages
color='GENDER', # Differentiate by gender
color_discrete_sequence=px.colors.qualitative.Set2 # Use a color palette
)
# Update layout for better readability
fig.update_layout(
xaxis_title="age",
yaxis_title="Count",
legend_title="Gender",
barmode='group' # Group bars for better comparison
)
# Show the plot
fig.to_image(format = "png")
fig.show()
## lets calculate the age at each diag
df_pat_cond["age_diag"] = df_pat_cond["START_DIAG"].dt.year-df_pat_cond["BIRTHDATE"].dt.year
# Age dist
fig = px.histogram(df_pat_cond, x='age_diag',
title="Age at diagnosis Distribution",
labels={'age': 'age_diag'},
# Number of bins for grouping ages
color_discrete_sequence=['blue'])
# Show the plot
fig.write_html("./output/data_visualization/diag_age_distribution.html")
fig.to_image(format = "png")
fig.show()
df_pat_cond.columns
Index(['START_DIAG', 'STOP_DIAG', 'PATIENT_ID', 'ENCOUNTER', 'CODE',
'DESCRIPTION', 'BIRTHDATE', 'DEATHDATE', 'SSN', 'DRIVERS', 'PASSPORT',
'PREFIX', 'FIRST', 'LAST', 'SUFFIX', 'MAIDEN', 'MARITAL', 'RACE',
'ETHNICITY', 'GENDER', 'BIRTHPLACE', 'ADDRESS', 'CITY', 'STATE',
'COUNTY', 'ZIP', 'LAT', 'LON', 'HEALTHCARE_EXPENSES',
'HEALTHCARE_COVERAGE', 'age', 'age_group', 'age_diag'],
dtype='object')
# Filter diagnoses at age 20
df_pat_cond["age_diag"] = df_pat_cond["START_DIAG"].dt.year-df_pat_cond["BIRTHDATE"].dt.year
# Age dist
age_20_diag = df_pat_cond[df_pat_cond["age_diag"] == 20]
# Calculate frequency of diagnoses at age 20
age_20_diag_freq = (
age_20_diag["DESCRIPTION"]
.value_counts()
.reset_index()
.rename(columns={"index": "count", "DESCRIPTION": "Frequency"})
)
# Display the top diagnoses causing the spike
print(age_20_diag_freq.head(10)) # Top 10 diagnoses at age 20
# Optional: Visualize the diagnoses at age 20
fig = px.bar(
age_20_diag_freq.head(10),
x="count",
y="Frequency",
title="Top Diagnoses at Age 20",
labels={"Frequency": "Number of Cases", "Diagnosis": "Diagnosis"},
orientation="h",
color="Frequency",
color_continuous_scale="Viridis",
)
# Save and show the visualization
fig.write_html("./output/data_visualization/top_diagnoses_age_20.html")
fig.to_image(format = "png")
fig.show()
Frequency count 0 Stress (finding) 81 1 Limited social contact (finding) 39 2 Social isolation (finding) 34 3 Normal pregnancy 23 4 Not in labor force (finding) 16 5 Victim of intimate partner abuse (finding) 14 6 Viral sinusitis (disorder) 12 7 Anemia (disorder) 12 8 Lack of access to transportation (finding) 10 9 Acute bronchitis (disorder) 10
df_pat_cond.to_csv("./output/diagnosis_data.csv",index=False)
df_pat_cond.head(5)
#• The distribution of patients by age, gender
import plotly.graph_objects as go
fig = go.Figure()
# Add traces for Age, Gender, and Race
fig.add_trace(go.Histogram(x=df_patients['RACE'], name='Race', marker=dict(color='blue'), visible=True))
fig.add_trace(go.Histogram(x=df_patients['GENDER'], name='Gender', marker=dict(color='green'), visible=False))
fig.add_trace(go.Histogram(x=df_patients['age_group'], name='age_group', marker=dict(color='orange'), visible=False))
# Add dropdown menu
fig.update_layout(
updatemenus=[
dict(
type="dropdown",
direction="down",
buttons=[
dict(label="Race",
method="update",
args=[{"visible": [True, False, False]},
{"title": "Race Distribution"}]),
dict(label="Gender",
method="update",
args=[{"visible": [False, True, False]},
{"title": "Gender Distribution"}]),
dict(label="age_group",
method="update",
args=[{"visible": [False, False, True]},
{"title": "Age Distribution"}])
],
showactive=True
)
]
)
# Add layout details
fig.update_layout(
title="Distribution Viewer",
xaxis_title="Value",
yaxis_title="Count",
template="plotly_white"
)
fig.write_html("./output/data_visualization/distribution_gender_age_race.html")
fig.to_image(format = "png")
# Show the figure
fig.show()
fig = px.pie(df_patients, names='GENDER', title="Gender Distribution in Patient Data")
# Show the chart
fig.write_html("./output/data_visualization/distribution_gender_pie.html")
fig.to_image(format = "png")
fig.show()
#• The distribution of patients by diagonisis
##
df_diag_freq = df_pat_cond[["PATIENT_ID","DESCRIPTION"]].drop_duplicates()["DESCRIPTION"].value_counts().reset_index().rename({"count":"frequency","DESCRIPTION":"Diagnosis"},axis=1)
# Select only the top 10 diagnoses
df_diag_freq_top = df_diag_freq.head(10)
fig = px.bar(df_diag_freq_top, x='Diagnosis', y='frequency', title="distribution of patients by diagnosis",
labels={'name': 'Diagnosis', 'frequency': 'frequency'},
color='frequency', color_continuous_scale='Viridis')
fig.write_html("./output/data_visualization/distribution_diag.html")
fig.to_image(format = "png")
fig.show()
# Group by diagnosis and age group to get frequency
df_diag_age_group = (
df_pat_cond.groupby(["DESCRIPTION", "age_group"])["PATIENT_ID"]
.nunique()
.reset_index()
.rename(columns={"DESCRIPTION": "Diagnosis", "PATIENT_ID": "Frequency"})
)
# Select top 10 diagnoses
top_10_diagnoses = (
df_diag_age_group.groupby("Diagnosis")["Frequency"].sum()
.nlargest(10)
.index
)
df_diag_age_group_top = df_diag_age_group[df_diag_age_group["Diagnosis"].isin(top_10_diagnoses)]
# Create stacked bar chart
fig = px.bar(
df_diag_age_group_top,
x="Frequency",
y="Diagnosis",
color="age_group",
title="Top 10 Diagnoses by Age Group",
labels={"Frequency": "Number of Patients", "Diagnosis": "Diagnosis", "AGE_GROUP": "Age Group"},
orientation="h",
color_discrete_sequence=px.colors.qualitative.Set2
)
# Save and show the visualization
fig.write_html("./output/data_visualization/top10_diagnoses_age_group.html")
fig.to_image(format = "png")
fig.show()
#• The most frequent medications
##
df_med_freq = df_medications[["PATIENT_ID","DESCRIPTION"]].drop_duplicates()["DESCRIPTION"].value_counts().reset_index().rename({"count":"frequency","DESCRIPTION":"Medication"},axis=1).head(20)
fig = px.bar(df_med_freq, x='Medication', y='frequency', title="Frequency of top 100 medications",
labels={'name': 'Medication', 'frequency': 'frequency'},
color='frequency', color_continuous_scale='Viridis')
fig.write_html("./output/data_visualization/top100medicationfreq.html")
fig.to_image(format = "png")
fig.show()
#• Trends in medical procedures over time.
df_procedures["DESCRIPTION_SHORT"] = df_procedures['DESCRIPTION'].str.slice(0, 35)
procedure_trends = df_procedures.groupby(['Year', 'DESCRIPTION_SHORT']).size().reset_index(name='Count')
fig = px.line(procedure_trends,
x='Year',
y='Count',
color='DESCRIPTION_SHORT',
line_group='DESCRIPTION_SHORT',
title='Trends in Medical Procedures Over Time',
labels={'procedure_desc': 'Procedure', 'Count': 'Frequency'})
fig.write_html("./output/data_visualization/trend_procedure_time.html")
fig.to_image(format = "png")
fig.show()
## description is too long lets slice it
#1. WE clearly see that in early 60s/70s we were having less number of procedures known and implemented compared to 20s
#2. renal dialsis were not present untill 1990s
#3. when look at depression screening earlier in 90s they were in less number but now it has become common and more frequent.
#4. from the data or subset of data we can see that "chemotherapy" was not present earlier but 2000s it has been more frequent
# Filter procedures for the last 10 years
current_year = df_procedures['Year'].max()
last_10_years = current_year - 9
df_procedures_last_10_years = df_procedures[df_procedures['Year'] >= last_10_years]
# Shorten procedure descriptions for clarity
df_procedures_last_10_years["DESCRIPTION_SHORT"] = df_procedures_last_10_years['DESCRIPTION'].str.slice(0, 35)
# Group by year and procedure description
procedure_trends_last_10_years = (
df_procedures_last_10_years.groupby(['Year', 'DESCRIPTION_SHORT'])
.size()
.reset_index(name='Count')
)
# Create the line chart
fig = px.line(
procedure_trends_last_10_years,
x='Year',
y='Count',
color='DESCRIPTION_SHORT',
line_group='DESCRIPTION_SHORT',
title='Trends in Medical Procedures Over the Last 10 Years',
labels={'Year': 'Year', 'DESCRIPTION_SHORT': 'Procedure', 'Count': 'Frequency'}
)
# Save and show the visualization
fig.write_html("./output/data_visualization/trend_procedure_last_10_years.html")
fig.to_image(format = "png")
fig.show()
C:\Users\veena.vemula\AppData\Local\Temp\ipykernel_11420\536863353.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
# Filter procedures for the last 10 years
current_year = df_procedures['Year'].max()
last_10_years = current_year - 9
df_procedures_last_10_years = df_procedures[df_procedures['Year'] >= last_10_years]
# Shorten procedure descriptions for clarity
df_procedures_last_10_years["DESCRIPTION_SHORT"] = df_procedures_last_10_years['DESCRIPTION'].str.slice(0, 35)
# Calculate total counts for each procedure
total_procedure_counts = (
df_procedures_last_10_years.groupby("DESCRIPTION_SHORT").size().reset_index(name='Total_Count')
)
# Select the top 10 most frequent procedures
top_10_procedures = total_procedure_counts.nlargest(10, "Total_Count")["DESCRIPTION_SHORT"]
# Filter data for only the top 10 procedures
procedure_trends_top_10 = df_procedures_last_10_years[
df_procedures_last_10_years["DESCRIPTION_SHORT"].isin(top_10_procedures)
]
# Group by year and procedure description
procedure_trends_top_10_grouped = (
procedure_trends_top_10.groupby(['Year', 'DESCRIPTION_SHORT'])
.size()
.reset_index(name='Count')
)
# Create the line chart
fig = px.line(
procedure_trends_top_10_grouped,
x='Year',
y='Count',
color='DESCRIPTION_SHORT',
line_group='DESCRIPTION_SHORT',
title='Trends in Top 10 Medical Procedures Over the Last 10 Years',
labels={'Year': 'Year', 'DESCRIPTION_SHORT': 'Procedure', 'Count': 'Frequency'}
)
# Save and show the visualization
fig.write_html("./output/data_visualization/trend_top_10_procedures_last_10_years.html")
fig.to_image(format = "png")
fig.show()
C:\Users\veena.vemula\AppData\Local\Temp\ipykernel_11420\3787298815.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
• Explore the relation nship between patient demographics and diagnosis or medication types.¶
bins = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, float('inf')] # Add infinity for 90+
labels = [f"{bins[i]}-{bins[i + 1] - 1}" for i in range(len(bins) - 2)] + [f"{bins[-2]}+"]
df_pat_cond['age_diag_group'] = pd.cut(df_pat_cond['age_diag'], bins=bins, labels=labels, right=False)
df_pat_cond
demo_diag_data = df_pat_cond[["PATIENT_ID","age_diag_group","STATE","MARITAL","RACE","ETHNICITY","GENDER","DESCRIPTION"]].drop_duplicates()
fig = get_demographic_plot(demo_diag_data,"GENDER","DESCRIPTION","Distribution of Diagonisis by Gender")
fig.to_image(format = "png")
fig.show()
## Alcohalism,tobacco has seen more in Male, pregnancy and miscariage only in female which shows correcness of data
## neck injury more in female
# impcated morals seen more in female
##?? need to save this fig
fig = get_demographic_plot(demo_diag_data,"RACE","DESCRIPTION","Distribution of Diagonisis by RACE")
fig.to_image(format = "png")
fig.show()
## Anemia is seen more in white, obesity or high BMI seen more in black and white but less in asian
fig_age_grp_diag = get_demographic_plot(demo_diag_data,"age_diag_group","DESCRIPTION","Distribution of Diagonisis by age_group")
fig_age_grp_diag.write_html("./output/data_visualization/fig_age_grp_diag.html")
fig.to_image(format = "png")
fig_age_grp_diag.show()
#osteoporosis seen more in senior citizens
## hpertension seen more in 10-19 years
## normal pregnanc is more in 20-29 years and reduces furhter
C:\Users\veena.vemula\Documents\GitHub\SDE-Skills-Test\utils\utils.py:98: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
• Explore the relation nship between patient demographics medication types
df_pat_med = pd.merge(df_medications, df_patients, on='PATIENT_ID', how='inner')
df_pat_med["age_med"] = df_pat_med["START_MED"].dt.year-df_pat_med["BIRTHDATE"].dt.year ## calcualte age at the time of medicine given
bins = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, float('inf')] # Add infinity for 90+
labels = [f"{bins[i]}-{bins[i + 1] - 1}" for i in range(len(bins) - 2)] + [f"{bins[-2]}+"]
df_pat_med['age_med_group'] = pd.cut(df_pat_med['age_med'], bins=bins, labels=labels, right=False)
df_pat_med
demo_med_data = df_pat_med[["PATIENT_ID","age_group","STATE","MARITAL","RACE","ETHNICITY","GENDER","DESCRIPTION"]].drop_duplicates()
demo_med_data## removing duplicates on above combination
fig =get_demographic_plot(demo_med_data,"age_group","DESCRIPTION","Distribution of medication by age_group")
fig.to_image(format = "png")
fig.show()
fig = get_demographic_plot(demo_med_data,"GENDER","DESCRIPTION","Distribution of medication by GENDER")
fig.to_image(format = "png")
fig.show()
fig = get_demographic_plot(demo_med_data,"RACE","DESCRIPTION","Distribution of medication by RACE")
fig.to_image(format = "png")
fig.show()